-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search - add case insensitive flag for "term" family of queries #61596
Conversation
Pinging @elastic/es-search (:Search/Search) |
5b9a7bf
to
cc70b86
Compare
server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitiveAutomatonQuery.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should add new functions rather than parameters ?
Something like termQueryCI
, prefixQueryCI
and wildcardCI
with a default implementation that throws a QueryShardException
?
cc70b86
to
c3ea3ff
Compare
I was following the approach of using parameters that we introduced in regexpQuery. |
For regexQuery I think it's different since it's only relevant for text and keyword field. Term queries are implemented by almost all field types so I think we can have at least a differentiation there ? |
ab4bdfe
to
053aee1
Compare
I'm going to copy the regexp docs for this flag:
I'll not include the flag in the example query though (despite this helping improve test coverage). The reason is that I want to avoid promoting the use of this query-time flag if an index-time choice (eg using a normalizer) is typically the better option. I'm not clear if the above is the best wording or how to present the choices between index-time and query-time case sensitivity options. It seems a pain to dive into that detailed discussion on every type of query that supports the new case insensitive flag. |
828e907
to
0539abc
Compare
I made the changes you suggested, @jimczi I feel uneasy about query docs for the flag - it would be nice to point to somewhere central that outlines the trade offs for case insensitivity at query-time Vs index-time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime fields stuff seems right to me. I've added @javanna as a reviewer to make sure he gets a look at it too. We're trying to stay pretty synchronized on this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime fields stuff seems right to me. I've added @javanna as a reviewer to make sure he gets a look at it too. We're trying to stay pretty synchronized on this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a naming concern, can we not use the CI acronym in the new method names? I find them hard to read, and makes it hard for readers to figure out what CI means in the context. I personally think that a slightly longer name is not a big problem.
server/src/main/java/org/elasticsearch/index/mapper/ConstantFieldType.java
Outdated
Show resolved
Hide resolved
29d895a
to
b82974a
Compare
Thanks for the review, @javanna . I changed the name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, thanks for adding the distinction for term queries
server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitiveAutomatonQuery.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitivePrefixQuery.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitiveTermQuery.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/lucene/search/CaseInsensitiveWildcardQuery.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/IndexFieldMapper.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/xpack/runtimefields/mapper/ScriptBooleanMappedFieldType.java
Outdated
Show resolved
Hide resolved
...s/src/main/java/org/elasticsearch/xpack/runtimefields/query/StringScriptFieldTermsQuery.java
Outdated
Show resolved
Hide resolved
...s/src/main/java/org/elasticsearch/xpack/runtimefields/query/StringScriptFieldTermsQuery.java
Outdated
Show resolved
Hide resolved
|
||
|
||
// Case insensitive form of terms query | ||
public Query termsQueryCaseInsensitive(List<?> values, @Nullable QueryShardContext context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what we should do here ? We'd need to add the support for case insensitive matching in TermInSetQuery
if we want to avoid building giant boolean query in the keyword
field. For now I think it's ok to not provide this option on terms
query since we cannot support them efficiently ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi terms
query is currently used by eql for more obvious queries like file where file_name in ("wininit.exe", "lsass.exe")
and (less obvious) for cidrMatch
function for matching an IP address in a list of CIDR blocks. Would be useful to have case insensitive support for it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(less obvious) for
cidrMatch
function for matching an IP address in a list of CIDR blocks.
Neat! I just want to make sure you aren't expanding the cidr blocks or anything - we natively support the cidr match in the term
and terms
query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to make sure you aren't expanding the cidr blocks or anything - we natively support the cidr match in the term and terms query.
We're not, we just pass the block to ES.
When dealing with an expression that forces us to do scripting, we use the underlying Lucene class for matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
Hopefully you'll be able to use runtime field before too long!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I think it's ok to not provide this option on terms query since we cannot support them efficiently ?
If we don't support it won't people just be forced to do what we do internally and create a bool should
array of term
queries?
Is there a query-complexity circuit-breaker we're somehow missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a +1 to have this supported in terms
query. The alternative in EQL, in case we decide to not support this, would probably be (as @markharwood mentioned) to create a bool
query with a bunch of term
queries in it. CC @jimczi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My worry is that we'd internally create an array of should term
queries because the TermsInSet
query doesn't handle case insensitive queries. It's maybe not a big issue but that wouldn't be consistent since a case insensitive terms query would fail with more than 1024 terms (the max boolean clause limit) while the normal one would use the optimized TermsInSet
query. So my take on this is that we should implement a proper support in TermInSet
if we really want to handle terms
case insensitive query. For EQL and SQL I don't think there's a real need to handle terms
though. The in
operator doesn't need to handle thousands of terms so imo that would be acceptable to translate it into an array of term
query. Same for the cidrMatch
function that also work on term
queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I'll pull the termsQueryCaseInsensitive
method for now. We can always add in another PR later. This PR probably has enough changes to consider already
0c80418
to
698b62a
Compare
I ran some benchmarks to see what effects this flag has on response times. Response Time (secs) for 1,000 queries on different fields in 2m doc index
Some of the observations this helped me draw out:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me. Thanks for iterating on this. I'd prefer that we consider terms
queries in a follow up since the current matching on keyword would not be able to leverage the TermsInSetQuery
but I don't have strong feelings either. We can also optimize later so I let you decide on that aspect ;).
return simpleMatchWithNormalizedStrings(pattern, str); | ||
} | ||
|
||
private static boolean simpleMatchWithNormalizedStrings(String pattern, String str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simpleMatchCaseInsensitive ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That name might suggest the case insensitivity is built into the method (it's not).
I wanted something less explicit ("post-normalization"?) that suggested the strings were assumed to have already been normalised (which for argument's sake could have been uppercasing not lowercasing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough
de8c3d2
to
8f9ad73
Compare
…, prefix, wildcard queries Closes elastic#61546
8f9ad73
to
685a419
Compare
test this please |
Removed |
In server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function, for now it only works with ASCII characters. May I know why not support foreign characters like Vietnamese? It is not consist with keyword. |
@jypan0115 your reading of the code is correct, as the comment in the code points out, the "insensitive option" currently only works for ASCII characters. Your you mind opening an enhancement issue pointing out where this fails e.g. for Vietnamese? I'm not sure how casing is handled in that language, but having an example will help us determine the implementation effort and prioritize this. |
@cbuescher I opened an issue about this https://github.com/elastic/elasticsearch/issues/95120 And here is a fail example, we have a field called name. If we store it as keyword field and use case insensitive term query to search for Ngô Đức(Uppercase) or ngô đức(lowercase), it works fine. But if it is stored as wildcard field, it failed when I use case insensitive term query. This can be fixed by removing these 3 lines in server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function.
And that's why I am asking why we only support ASCII characters in wildcard field. Hope this helps. |
Thanks, I'll copy parts of this short note over to the issue. |
Adds case insensitive option to term, terms, terms_in_set, prefix, wildcard queries
First cut.
Closes #61546